Compressed-Domain Pattern Matching with the Burrows-Wheeler Transform
نویسنده
چکیده
This report investigates two approaches for online pattern-matching in files compressed with the Burrows-Wheeler transform (Burrows & Wheeler 1994). The first is based on the Boyer-Moore pattern matching algorithm (Boyer & Moore 1977), and the second is based on binary search. The new methods use the special structure of the BurrowsWheeler transform to achieve efficient, robust pattern matching algorithms that can be used on files that have been only partly decompressed. Experimental results show that both new methods perform considerably faster than a decompress-and-search approach for most applications, with binary search being faster than Boyer-Moore at the expense of increased memory usage. The binary search in particular is strongly related to efficient indexing strategies such as binary trees, and suggests a number of new applications of the Burrows-Wheeler transform in data storage and retrieval.
منابع مشابه
Approximate Pattern Matching Using the Burrows-Wheeler Transform
The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T, using a compressed representation of T, with minimal (or no) decompression. In this paper, we consider approximate pattern matching on the text transformed by the Burrows-Wheeler Transform (BWT). This is an important first step towards developing compressed pattern matching algorithm for BW...
متن کاملApproximate Pattern Matching Over the Burrows-Wheeler Transformed Text
The compressed pattern matching problem is to locate the occurrence(s) of a pattern P in a text string T using a compressed representation of T , with minimal (or no) decompression. In this paper, we consider approximate pattern matching directly on Burrow-Wheeler transformed (BWT) text which is a critical step for a fully compressed pattern matching algorithm on a BWT based compression algorit...
متن کاملEntropy-Compressed Indexes for Multidimensional Pattern Matching
In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...
متن کاملA Comparison of BWT Approaches to Compressed-Domain Pattern Matching
A number of algorithms have recently been developed to search files compressed with the Burrows-Wheeler Transform (BWT) without the need for full decompression first. This allows the storage requirement of data to be reduced through the exceptionally good compression offered by BWT, while still allowing fast access to the information for searching. We provide a detailed description of five of t...
متن کاملOn Entropy-Compressed Text Indexing in External Memory
A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler tra...
متن کامل